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Abstract 

We consider a variation of the Wyner-Ziv problem pertaining to lossy compression 
of individual sequences using finite-state encoders and decoders. There are two main 
results in this paper. The first characterizes the relationship between the performance of 
the best M-state encoder-decoder pair to that of the best block code of size I for every 
input sequence, and shows that the loss of the latter relative to the former (in terms 
of both rate and distortion) never exceeds the order of (\ogM)/£, independently of the 
input sequence. Thus, in the limit of large M , the best rate-distortion performance of 
every infinite source sequence can be approached universally by a sequence of block codes 
(which are also implementable by finite-state machines). While this result assumes an 
asymptotic regime where the number of states is fixed, and only the length n of the 
input sequence grows without bound, we then consider the case where the number of 
states M = M n is allowed to grow concurrently with n. Our second result is then about 
the critical growth rate of M n such that the rate-distortion performance of M„-state 
encoder-decoder pairs can still be matched by a universal code. We show that this 
critical growth rate is of M n is linear in n. 



Index Terms: Finite-state machines, individual sequences, side information, block 
codes, universal coding, Wyner-Ziv problem. 

1 Introduction 

In a series of papers from the late seventies until the mid-eighties, Ziv [11], [12], [13], and Ziv 
and Lempel [14], [4] have developed a theory of universal compression of individual sequences 
using finite-state machines (FSM's). In particular, the work [11] focuses on universal, 
fixed-rate, (almost) lossless compression of individual sequences using finite-state encoders 
and decoders, which was then further developed to the well-known Lempel-Ziv algorithm 
[14], [4]. In [12], the framework of [11] was extended to lossy compression, and in [13], the 
results of [11] were extended in another direction, pertaining to almost lossless compression 
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in the presence of an (individual) side information sequence at the decoder, namely, an 
analogue to Slepian-Wolf coding [7] for individual sequences. 

In this work, we take yet another step in this direction and further generalize this 
model setting, of universal coding for individual sequences using finite—state encoders and 
decoders, to that of lossy compression in the presence of side information at the decoder, in 
other words, Wyner-Ziv (W-Z) coding [10] for individual sequences. Also, unlike the fixed- 
rate codes assumed in [11], [12], [13], here our model allows variable-rate coding, which give 
rise to considerably more flexibility. On the other hand, in our model, the side information 
sequence at the decoder is assumed to be generated from the individual source sequence (to 
be compressed) via a known memoryless channel, in contrast to [13], where it is modelled 
as another individual sequence. 1 Furthermore, our model setting can also be viewed as an 
extension of the setting of universal finite-state denoising of individual sequences corrupted 
by stochastic noise (cf. [9], [6] and references therein): The denoising problem is actually a 
special case of this W-Z model, where the coding rate is zero. 

There are two main results in this paper. The first result is a characterization of the 
relationship between the performance of the best M-state finite-state encoder-decoder pair 
(for a given input sequence) to that of the best block code of size I for every input sequence, 
and it shows that the loss of the latter relative to the former (in terms of both rate and 
distortion) is of the order of (logM)/£, independently of the input sequence. Thus, in the 
limit of large M, the best rate-distortion performance of every infinite source sequence can 
be approached universally by a sequence of block codes (which are also implementable by 
finite-state machines) . One of the interesting features of these universal codes is that they 
require no binning, as opposed to the well-known W-Z code in classical in the probabilistic 
case [10]. We also extend this result to framework of successive refinement coding (cf. e.g., 
[2], [5], [8]), where there are two encoders and two decoders (all of which are finite-state 
machines): The first encoder transmits a relatively coarse description to the first decoder, 
which has access also to a certain side information stream. The second encoder sends a 
refinement code, to another decoder that has access also to the first compressed bitstream, 

as well as to another side information sequence. This is setup is in analogy to a recent study 

lr The reason for this assumption is that even in the classical, probabilistic setting, W-Z coding cannot 
be universal w.r.t. the channel from the source to the side information stream (unless there is feedback) as 
the encoder, which has no access to the side information, cannot 'learn' its statistics. This is different from 
the Slepian-Wolf setting, where the encoder does not depend on these statistics. 
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on successive refinement coding for the W-Z problem for probabilistic memoryless sources 
[8]. However, in contrast to [8], where a certain Markov structure had to be assumed 
regarding the source and the two side information processes, here no such structure is 
needed. Also, unlike in [8], here the extension to multiple stages is straghtforward. 

Returning to the single-stage coding model, we next relax the assumption that the num- 
ber of states is fixed, independently of the length n of the input data. In other words, we 
examine an asymptotic regime that allows the number of states M = M n to grow concur- 
rently with n, and we investigate the critical growth rate of M n such that the rate-distortion 
performance of M n -state encoder-decoder pairs can still be matched by a universal code 
(in a sense to be made precise later on). Our second result is that this critical growth rate 
of M n is linear in n, in the sense that if M n = n 9 , universal achievability is guaranteed for 
all 9 < 1, but not for 9 > 1. In other words, 9 = 1 is the critical value of 9. 

In this context, it is interesting to go back, for a moment, to the lossless case without side 
information and to consider the performance of the well-known Lempel-Ziv algorithm in 
that respect. By examining the converse to the coding theorem (Theorem 1) in [14], which 
states that the best compression achievable by an FSM with M states is lower bounded by 
a quantity the behaves roughly like clog(c/4M 2 ) where c is the number of distinct phrases 
in the input sequence. Since the length of the LZ code is about clogc, the gap, clog(4M 2 ), 
would be relatively negligible only as long as logM would be very small compared to 
logc ~ 0(logn), which is guarateed to be the case when 9 is very small (9 « 1). It, 
therefore, turns out that there is a gap between the best that can be done by a universal 
code, where 9 can be chosen arbitrarily clode to unity, and the performance of the LZ 
algorithm in that respect. 

The outline of this paper is as follows. In Section 2, we define the problem and establish 
notation conventions. Section 3 is devoted to the derivation of the first result described 
above, pertaining to a fixed number of states. Finally, Section 4 focuses on the critical 
growth rate of the number of states. The extension of the results of Section 3 to successive 
refinement coding is deferred to the Appendix, for the sake of continuity between Sections 
3 and 4 which are both about single-stage codes. 
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2 Notation and Problem Formuation 



Throughout the paper, random variables will be denoted by capital letters, specific values 
they may take will be denoted by the corresponding lower case letters, and their alphabets, 
as well as some other sets, will be denoted by calligraphic letters. Similarly, random vectors, 
their realizations, and their alphabets, will be denoted, respectively, by capital letters, 
the corresponding lower case letters, and calligraphic letters, all superscripted by their 
dimensions. For example, the random vector Y n = (Y±, . . . , Y n ), (n - positive integer) may 
take a specific vector value y n = (y±, . . . ,y n ) in y n , the nth order Cartesian power of y, 
which is the alphabet of each component of this vector. For i < j (i, j - positive integers), 
xj will denote the segment (xj, . . . , Xj), where for i = 1 the subscript will be omitted. 

Let x = (x\, X2, ■ ■ .), Xi £ X, i = 1,2,..., with \X\ = a < oo, be an infinite input 
sequence (to be compressed) and let y = (yi,y2, ■ ■ ■), y% € y, i = 1, 2, . . ., with \y \ = (3 < oo, 
be a corresponding side-information sequence generated by a given discrete memoryless 
channel (DMC) 

n 

P(y 1 ,...,y n \x 1 ,...,x n ) = Y[P(yi\xi), n=l,2, .... (1) 

i=i 

When the sequence x is sequentially fed into a variable-rate finite-state encoder £ = 
(S,f,g), this encoder generates an infinite sequence of binary strings of variable length, 
u = (ui,U2, • • •), while going through an infinite sequence of states s±, S2, ■ ■ ■ according to 

Ui = /(Sj,Xj), 

Si+i = g(si,Xi), z = l,2, ... (2) 

where the initial state si is assumed to be a certain fixed member of S. At the same time 
and in a similar manner, the finite-state decoder V = (S',f',g r ) sequentially maps u and 
y to an infinite reproduction sequence x±, £2, ■ ■ ., &i G X, i = 1, 2, . . ., with \X\ = 7 < 00, 
using the recursion 

Xi-d = f'is'^u^yi), i = d+l,d + 2,... 

s 'i+i = 9'(4i u i,yi), * = 1,2, .... (3) 

where d (positive integer) is the encoding-decoding delay, and the initial state s[ is assumed 
to be a certain fixed member of S'. It is assumed that at each time instant i, when the 
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decoder is at state s' i} it is able to isolate the current input codeword ui from the following 
codewords of the compressed bitstream, 1^+1,1^+2, — 2 To this end, we allow a prefix code 
C(s') associated with each s' G S' (with the option that for some s' G S', C(s') may be 
empty, in which case, the decoder idles). For every u G C(s'), let L(u) denote the length of 
u (in bits). We then assume that the Kraft inequality 

J2 2- L ^ < 1 (4) 
uec(s') 

holds for all s' G S'. Note that the above discussion continues to apply when single code- 
words are replaced by ^-vectors, i/, formed by concatenating i (legitimate) codewords 
successively. In this case, let C (s') denote the supercode formed by all {u e } that originate 
from state s'. Clearly, since the components of u can be identified recursively, the supercode 
satisfies the Kraft inequality as well w.r.t. the length function L(u e ) = J2i=i L(ui). 

For a given single-letter distortion measure p : X x X — > IR + , let A(x n ,S,V) denote 
the expected distortion ^ Ya=i Ep( x i, ^i) associated with the encoding and decoding x n 
by (£,T>), where the expectation is w.r.t. the DMC. Now, define 

A M4 (x n ,R)=mmA(x n ,£,V) (5) 

where the minimum is over all encoder-decoder pairs {(£,T>)} having no more than M 
states each, with delay no longer than d, and which satisfy the rate constraint: 

n 

^L( Ui )<nR. (6) 
i=i 

While the optimum (£,V) for achieving Am^x 71 ,R) depends, in general, on x n , we are 
interested in a universal algorithm (independent of x n ) that 'competes' with the best M- 
state encoder-decoder pair (£,T>), and eventually approaches the operational (finite-state) 
W-Z distortion-rate function, which we define as: 

A(x,R)= lim lim limsup A Md (x n ,R). (7) 

d^oo M— >oo n^oo 

Note that the order of the limits is first over M and then over d. This is because given M, 
the delay d cannot be arbitrarily large. To implement a finite-state machine with delay d, 
such a system must store the d most recent inputs, which requires the number of states to be 



2 To this end, it would make sense to assume that the dependence of m on s[ is only via a part of 
the decoder state information that is independent of the (random) SI sequence, e.g., s", which is updated 



according to s" + i = g"{s",Ui) 
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at least exponential in d, namely, the maximum possible delay for a given M, which we shall 
denote by du-, is proportional to logM. A somewhat stronger notion of the operational 
finite-state W-Z distortion-rate function is then given by 

A*(x,R) =\ lim \imsupA MtdM (x n ,R). (8) 

Obviously, A*(x,R) < A(x,R). 

3 Long Block Codes are as Good as FSM's 

We start by defining the informational W-Z distortion-rate function (as opposed to the 
operational definitions of eqs. (7), (8)) in the following manner: Let n and I < n be two given 
positive integers, assume, without essential loss of generality, that £ divides n, and let us 
chop the sequence x n into ^-blocks, {^+i^}"=o 1 ■ Now, let P(a e ), a 1 = (a±, . . . , ae) G X*, 
denote the empirical probability (relative frequency) of the ^-vector a along x n , i.e., 

n/i—l 

p(a l ) = l - E = A (9) 

n k=o 

and for any b = (bi, . . . , bg) G y , let 

e 

P(a e ,b e ) = P(a e )P(b e \a e ) = P(a e ) ■ n^K>, (10) 

i=i 

where {P(bi\a^} are the single-letter transition probabilities associated with the DMC (1). 
Let (X e ,Y e ) designate random ^-vectors jointly distributed according to {P(a e ,b e ), a 1 G 
X e , b l G y e }, and define the £-th order informational W-Z distortion-rate function of the 
source X e w.r.t. the SI Y e as follows: 

A x e lY e(R)=tmn- e Ep(X e ,h(Y e ,U)) (11) 

where the minimum is over all functions {h} from yr x U to X e , and over all RV's {U} 
that: (i) take on values in an alphabet U of size \X\ l + 1, (ii) satisfy the Markov relation 
U — > X e — > y^, and (iii) satisfy the inequality 

i7(C/) < (12) 

Our main result in this section relates the operational W-Z distortion-rate function, for 
finite M and d, to the £-th order informational distortion-rate function associated with the 
empirical distribution of x n , as defined above: 
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Theorem 1 For every positive integers n and £ such that £ divides n: 

A„(,», R) > A X , IY , (R + 2*Si£) - *■=*, (13) 

w/iere p max = max Xii p(x, x). 

Note that the rate-redundancy, (21ogM)/£, depends on the number of states whereas 
the distortion redundancy, p ma , x d/£, depends only the delay However, referring to relation- 
ship between d and M (cf. the last paragraph of Section 2), the distortion-redundancy term 
p max d/£, is also bounded by a quantity proportional to (logM)/£, like the rate-redundancy. 

Defining now the informational W-Z distortion-rate function of the infinite sequence x 

as 

A X , Y (R) = limsuplimsup A x ei Y e(R), (14) 

we have the following corollary to Theorem 1, which means that long block codes arc 
asymptotically sufficiently good to attain the asymptotic performance of general finite- 
state codes: 

Corollary 1 For every x, 

A X{Y (R) > A*(x,R) > A X{Y (R+) i hm A X{Y (R'). (15) 

Observe that, by monotonicity, A X ^ Y (R + ) = ^_Xjy(-^) (hence the two inequalities be- 
come equalities) for every R > 0, with the possible exception of a countable set of points. 
Corollary 1 follows from Theorem 1 in a simple manner: For any given R' > R, Theorem 
1 implies that AM,d M (x n ,R) > A x ^ Y t(R') — Pmaxdm/^ for all sufficiently large i. Taking 
first, the limsup as n — > oo, next the limsup as I — > oo, then the limit as M — > oo, and 
finally, the limit R' — > R, gives the right inequality. The left inequality follows from the fact 
that a block code of length i is implementable by an FSM with c/ states (cf. [11, Example 
2, p. 406]) thus A X i\ Y i(R) > A a t d (x n ,R), and the result is obtained by taking first the 
limsup over n and then the limsup over £ at both sides of this inequality. 

We now turn to the proof of Theorem 1. 
Proof. For a given combination of an M-state encoder and an M-state decoder (£,V), 
consider the joint probability distribution 

n/e-i 

P(a , c ,s,s) = - 2^ H^Ti = a , = c , s u+l = s, s u+1 = s }, (16) 
n i=0 
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for every a e G X e , s G S, s' G S', and c l G C^(s'). Note that depends (determinis- 

tically) only on y%+[, uj|+f, and Let us denote then = Stf+i), 

and assuming that £ > d, let x l [^ +t = ^'(2/^+1,^+1, s ^+i) be defined simply by truncating 
the first d components of fr(i/i£+i> u i&£i> s^ +1 ). Now, define 

P(a?,tf,<f,a?- d ,s,i/) = P(a e ,c e ,s,s')P(b e \a e )l{a e - d = ti(b e ,c e ,s>)} 

= P(a e , c e , s, S ')[{[ P(Jb i \a i )]l{a t - d = h'(b e , c e , s')}, (17) 
i=i 

for every b l G 3^, and d^ d G Let (X*, Y e , U £ , X e ~ d , S, S') designate a random vector 

that is distributed according to this joint probability mass function. Now, accoding to the 
rate constraint: 



n 



nR > J2 L ( 

i=i 
n/e-i 



E H^ti) 

i=0 

E jV,*W) 



> 7E E ^^Olog^ (18) 

t s'eS' decks') [ 1 ; 

where the last step follows from the postulate that {L(<r), cr G C (s')} satisfy the Kraft 
inequality for each s'. It follows then that H(U e \S') < IR. Therefore, 

£R > H(U e \S') 

= H(U e ,S')-I(S';U e ) 

> H{U e ,S') - H(S') 

> H(U e ,S') -logM. (19) 

Next, observe that uf^i depends solely on sa + i and x\^_ 1 but not on yf^ii an d so > ^ — > 
(X e , S) — > is a Markov chain. But since S — > X^ — > is also a Markov chain (due to 
the DMC), then so is (U e , S) -> X e ^ Y e . In addition, X e ~ d = h'(Y e , U e , S'). It therefore 
follows that 

-Ep(x n ,X n ) > -Ep(X t - d ,X lt - d ) 
n £ 

> - e [Ep(X e ,X e )-d-p max ] (20) 



where X is defined by concatenating X with a random d-vector in X that is an 
arbitrary function of (Y £ , U e , S'). Next, observe that 

£R > H(U £ ,S') -log M 

> H(U e , S, S') - 2 log M, (21) 

that is, 

H(U>,S,S>)<l(R + 2 -^), (22) 

and the Markovity of the chain (U e , S) -> X 1 -» Y l implies Markovity of (U E , S, S') -» 
X £ — > F . Moreover, for the purpose of deriving a lower bound, let us also allow h to 
depend on S too, i.e., X e = h(Y e ,U e , S, S'). We now have the following lower bound to 

Am^x", R) > ^rnin h \Ep{X\ h(Y e , U*, S, S')) - ^ (23) 
where the minimum is subject to the constraints: 

H{U*,S,S!)<l(R+*^) (24) 

and 

(U e , S, S f ) -> X 1 ^ Y l is a Markov chain. (25) 

Now, observe that in this minimization problem U , S 1 , and 5" appear always together. Let 
us define then U = (U^,S,S') and further reduce this expression by taking the minimum 
of the distortion over all (U, h) subject to the constraints H(U) < £(R + 21ogM/£), and 
U — > X 1 —>■ Y e , which is A x t^ Y e(R + 21ogM/£) by definition. This completes the proof of 
the Theorem 1. □ 

We now describe a (universal) block coding scheme that asymptotically (for large £) 
achieves Ajf.y (R) and hence also A*(jc, R) (for almost all values of R): Given x n , compute 
its ^-th order empirical distribution, {P(a ), G X } (which is P(X )), and find the 
RV J7 and the function h that achieve A x ei Y e(R). The (stochastic) encoder applies the 
channel P(?7|X £ ) to every ^-vector xf^_\ and then performs entropy coding according to 
the marginal of U, after transmitting a header that describes the entropy coding rule of the 
optimum U (which depends only on the marginal of U) and the function h, which together 
require no more than \log(n/£ + 1)°^] bits, which is log of the number of different empirical 
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distributions of superletters formed by ^-vectors). The decoder first decodes the header, 
then U, and finally, reconstructs the source by applying X £ = h(Y e , U). The rate is upper 
bounded by 

iuiogp + if]] + j[H{U) + 1]} < R+ \ + ^log g + l) (26) 

where we have used the fact that H(U) < £R. Clearly, if £ is large and n » a e , this is 
arbitrarily close to R. The distortion A X i\ Y t(R) is maintained by definition. 

Note that complexity of this scheme is mostly in the optimization over h and U, which 
is not negligible, but is a function of £ only. This is in contrast to the schemes proposed 
in [12], [13], which require an exhaustive search over sets of sequences of length n{» a e ), 
namely, complexity that grows exponentially with n. The stochastic encoder X IL —> U can 
also be implemented deterministically, but then the encoding complexity will be exponential 
in n: Select independently at random a set of M = 2 ( - n / e ^ I( - xi ' U ' )+ ^ vectors U n (i), i = 
1, . . . , M. Given x n , find a jointly typical vector U n (i) (in the superalphabet of ^-vectors), 
and transmit it using (n/£)[I(X e ; U) + e] < (n/£)[H(U) + e] bits plus a (relatively small) 
header that describes the type class of x n , from which the decoder can also figure out h on 
its own. By the Markov lemma, with high probability, (U,X £ ,Y e ) will be jointly typical 
and hence X n will satisfy the distortion constraint. 
Discussion. Three comments are in order: 

1. When x emerges from a discrete memoryless source (DMS), rather than being an 
individual sequence, it is well known that the distortion-rate function is the classical 
W-Z distortion rate function for that DMS, A^,y(R)- Clearly, by analyzing the 
performance of block codes (rather than FSM's) on the given DMS, using the same 
tecnique as above, one can show that 

A%§r(R) = MA xt \ Yt (R), (27) 

where now X 1 designates a random ^-vector from the source. This is true because for 
both sides of this equality, we have a direct theorem and a converse theorem. 

2. In view of item no. 1 above and our direct thoerem, we have actually shown that 
it is possible to approach the W-Z rate-distortion function (of both a DMS and an 
individual sequence) without binning. 
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3. The above result extends to a model of scalable coding (successive refinement), in 
analogy to [8]. The details and the discussion appear in the Appendix. 

4 Universalilty and the Critical Growth Rate of M 

In the previous section, we have considered a reference class of encoders and decoders with 
a fixed number of states, M, and a fixed delay, d, and we have shown that the (operative) 
rate-distortion function w.r.t. this class can be approached by using (sufficiently long) block 
codes. In this section, we address a somewhat different question, pertaining to a regime 
that allows both the number of states and the delay to grow with n, the length of the input 
sequence x n . That is, M = M n and d = d n . For the sake of simplicity, we will be 'generous' 
with regard to the delay, allowing it to be maximum, namely, d n = dM n , and then we focus 
on the following question: What is the highest growth rate of M n as a function of n, below 
which it is still possible to universally attain (in a sense to be defined soon), using any 
general block code for n-sequences, the performance of the best encoder-decoder with M n 
states and delay cImJ 

For the sake of convenience, in this section, instead of confining ourselves to either 
rate-distortion functions or distortion-rate functions, we will treat rate and distortion in a 
more symmetric fashion by defining achievable pairs (R, A) in the spirit of the definitions 
in Section 1: Given x n , a pair (R,A) is said to be M n -achievable if there exists an M n - 
state encoder £ = (S,f,g) and a M n -state decoder V = (S',f',g'), with overall delay not 
exceeding d n = c!m„, such that Ya=i L{ui) < nR and Ya=i Ep(xi,Xi) < nA. Referring to 
the previous definitions, given x n , the pair (R, AM n ,d Mn (x n , R)) is always M„-achievable. 

While the definition of an achievable pair (R, A) allows the choice of an encoder-decoder 
that depends on x n , we now define the notion of a universally achievable pair (R, A): A 
pair (R,A) is said to be universally achievable w.r.t. {M„} n >i if for every e > 0, 8 > 0, 
and n sufficiently large, there exists an encoder-decoder that achieves rate less than or 
equal to R + e and distortion less than or equal to A + 5 for every x n for which (R, A) is 
M n -achievable. 

Let us assume, from now on, that M n grows asymptotically linearly with some power 
of n, that is, lim n ^ 00 (log M n )/ log n = 6, where 9 is a certain positive real. In this section, 
we are interested in the critical value of 6 below which universal achievability of every pair 
(R, A) is guaranteed, but above this critical value, there exist pairs (R, A) that are not 
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universally achievable. 

The following two theorems tell us that this critical value is = 1, in other words, the 
critical asymptotic growth rate of M n is linear. 

Theorem 2 (Direct Theorem): If < 1, then every (R, A) is universally achievable. 

Proof. Assume, for the sake of simplicty and without essential loss in generality, that 
M n = n 6 , < 1, and consider the following mechanism for an encoding x n : The encoder 
examines all possible pairs {(£,V)} with M n states, on the given x n , and computes the 
coding rate and the expected distortion (w.r.t. the randomness of the known channel from 
x n to y n ). If it finds an encoder-decoder pair (£*,V*) that achieves a specified rate- 
distortion pair (R,A), it first transmits a header with the description of T>* and then 
encodes x n using £* (if it does find such an encoder then (R, A) is not M„-achievable in 
the first place). The decoder, after decoding the index of the decoder V*, uses this decoder 
to produce the reproduction x n based on y n and the remaining part of the bitstream. 
Obviously, the distortion associated with such an encoder is the same as that of (£*,£>*), 
namely, less than or equal to A. The rate is the same as that of (£*,£>*) (which means less 
than or equal to R) plus the rate associated with the header. Thus, it remains to show that 
the normalized redundancy associated with the header goes to zero as n — > oo whenever 
< 1. To this end, we now evaluate the number of bits necessary to describe the decoder 
with V*. 

First, observe that for each s' { <G S', Ui may take values in a prefix code C{s' i ), which 
can be described by a tree. As each such tree contains at most a leaves, and there are 
(k — 1)! different trees with k leaves, the total number of possible trees is K = J2k=i(k ~ 
thus the description of each such tree takes [log K~\ bits, and since such a tree should be 
specified for every s' E S' this takes M n ■ [log if] bits altogether. Next, the function /' 
should be described. As there are no more than ^ MnOL P functions {/'} for every possible 
tree, the description of /' takes \M n af3 log 7] bits. Similarly, the description of g 1 requires 
at most \M n a(3 log M n ] bits. Thus, the total number of bits associated with the header is 
then 0(n e logn), which when normalized by n (to get the redundancy per source symbol), 
tends to zero as n — > 00, since is assumed strictly less than unity. This completes the 
proof of the Theorem. □ 

For the converse part, we will make the additional assumptions that X = X, X is 
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a group with an addition operation (modulo a), and that the distortion function p(x,x) 
depends on x and x only via their difference x — x (w.r.t. the group arithmetic). We will 
then denote po(x — x) = p(x, x), where po : X — > IR + . 

Theorem 3 (Converse Theorem): Under the assumptions of the last paragraph, if 6 > 1 
there exist pairs (R, A) that are not universally achievable. 

Discussion. An alternative question with regard to universality w.r.t. FSM's with a grow- 
ing number of states, which is closer in spirit to the results of Section 3, could have been 
the following: What is the critical growth rate of M n that still allows AM n ,d Mn (x n , R) to 
be achievable by a universal code for all x n ? This definition is seemingly stronger because 
it appears to give rise to 'adaptation' of the distortion to the given x n rather than mak- 
ing a commitment to a fixed distortion level A, regardless of x n . However, it is easy to 
see that both our direct theorem and converse theorem are suitable for this definition too, 
and therefore, so is the conclusion regarding the linear critical rate of M n . The direct 
part would be the same, but with (£*,T>*) being defined as the encoder-decoder pair that 
achieves AM n ,d Mn (x n , R), or alternatively, RM n ,d Mn {x n , A), the M n -state rate-distortion 
function, defined in a dual manner. Regarding the converse, as we demonstrate in the proof 
of Theorem 3 below, for every given > 1, there is a pair (R, A) which is M Tt -achievable for 
certain sequences, and hence AM n ,d Mr (x n ,R) < A, or, equivalents, R Mn ,d Mn {x n , A) < R. 
But no single encoder performs even close to RM n ,d M (x n , A) simultaneously for all these 
sequences. 

Proof of Theorem 3. Assume, again, that M n = n e . We will now show that for 9 > 1, there 
exists a rate-distortion pair (R, A) which is not universally achievable. For a given A, let 

0(A) = max{H(Z) : Ep (Z) < A}, (28) 

where H(Z) is the entropy of an RV Z taking on values in X . Let -R^y(A) denote the 
W-Z rate-distortion function of the memoryless uniform source X with side information Y 
(generated by the given DMC P{y\x)) w.r.t. p. For a given A, and a given 6 > 1, let 

(29) 

and select A to be sufficiently small such that 

R= f^l < R *\* {A) ~ Rx{A) = l0g " " 0(A) ' (30) 
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where i?x(A) is the ordinary rate-distortion function (without side information) of the 
memoryless uniform source X w.r.t. p. We wish to show that for such a choice of R and A, 
there exists a set of sequences {x n } for each of which (R, A) is M n -achievable, but on the 
other hand, there is no single code that simultaneously achieves (R, A) for all sequences in 
this set. 

For the above choice of R and A, consider a random process defined as follows. Let m 
be the the solution to the equation 

2 Rm = — , (31) 
m 

and assume that this solution is integer. Further, let J- be a set of 2 mR m-vectors T = 
{ui, . . . , u 2 ™r}, Ui, G X m , i = 1, . . . , 2 mR . Assuming that m divides n (i.e., 2 mR is integer), 
let x n be formed by concatenatingg n/m m-vectors {x m (i), i = 1, . . . ,n/m}, where 

x m {i) = u m {i)+z m {i), i = l,..., n/m, (32) 

u m (i) is an arbitrary member of T , and z m {i) G X m is a vector of i.i.d. random variables, 
each component of which is distributed according to the distribution P* on X which achieves 
0(A). Now, we further assume that J 7 is a good 3 code for the additive memoryless channel 
X = U + Z, where U designates the channel input at rate R, Z ~ P* is the memoryless 
noise, and X stands for the channel output. Since R is assumed less than the capacity of 
this channel, given by C = i?x(A) = log a — 4>(D), then such a good code exists. This 
means that upon observing x m (i), one can identify u m (i) correctly with high probability, 
provided that m is large. 

Consider next a block code of length m operating on x n as follows: For every i = 
1, . . . , n/m, the encoder first decodes u m (i) from x m (i), and then transmits a description of 
the decoded version, say u m (i), using log \!F\ = mR bits. The decoder, in turn, reconstructs 
x m (i) = u m {i) (without using the side information). Since u m (i) = u m (i) with high prob- 
ability, then the distortion between x m {i) and x m {i), is about A, because the noise z m {i) 
is distributed according to P* , whose po-moment does not exceed A. This means that the 
pair (R, A) is essentially achievable by a (non-universal) block code of size m. 

Now, since the set of typical input m-vectors to the block code is of size that is of 

the exponential order of 2 m [ i? +^( A )l , this block code can be implemented by an FSM with 

with essentially no more than m2 m ^ R+ ^ A ^ states. This can be done by constructing an 
3 In the sense of small error probability. 
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(incomplete) a-ary context-tree with 2 m [ R+ ^( A )] leaves, corresponding to the various typical 
sequences, where the state set is the set of nodes plus the leaves of this incomplete tree. 4 
Thus, the number of states as a function of n is given by 

M n < m2 m i R +^ 
= n l+4>(A)/R 

= n e . (33) 

This means that we have shown that for the above choice of R and A, the pair (i?, A) is 
n e -achievable for every x n that is typical to the above defined process. 

We now argue that no block-code of length n can attain (R, A) simulatenously for all 
typical sequences of the process (32), and for all 'good' channel codes {J 7 }, which yield 
error probability less than 2~ m ^ Er ^~ s \ where E r (R) is the random coding error exponent 
[3] of the channel X = U + Y w.r.t. the uniform random coding input distribution, and 5 € 
(0,E r (R)) is arbitrary. 5 Furthermore, we show that even 

{Rx\y ( A ) - e > A )> for arbitrarily 
small (but fixed) e > 0, cannot be simultaneously attained for all those sequences (recall 
that R^y(A) > R). 

Let us denote the set of all the typical u-sequences by Q, i.e., Q is the set of all sequences 
{u n } whose segments {u m (i)} form a code (for the channel X = U + Y) whose error 
probability, P e (u n ), is below 2^ m ^ Er( - R ^^ , and observe that 

pfp (Tjn\\ o-mE r (_R)l 

Vi{Q c } = Pr{u n : P e (u n ) > 2 - m ^ R ^} < X ^Ji < = 2~ m< \ (34) 

where the first inequality follows from the Chebychev inequality. Now, assume conversely, 
that there exists a source code that does achieve (-R^f- (A) — e, A) for all x n induced by all 
u n 6 Q and all typical z ra -sequences, namely, sequences for which (a very high fraction of) 
the m-segments {z m (i)} are P*-typical. Then, we have 

-L £ EL{u n + Z n ) < n[R%Z(A) - e] (35) 
l y l u™eg 

4 To be precise, one should add m more states corresponding to a modulo-m time counter, in order to 
idle for non-typical sequences (which are not terminated by the leaves) until the end of the block and then 
submit an error message. 

5 Such a code may harm the rate (beyond R) and the distortion (beyond A) only for fraction 2~ m ^ ET ( R ' > ~ s ^ 
of the m-segments. 
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and 

]T Ep(u n + Z n , X n ) < nA, (36) 



1^1 v <-C, 



and so, for every A > 0, 



-!- J2 [ EL W + zn ) + XE P( un + ^ n )] < n iRx\Y(&) - e + AA], (37) 

where the inner expectations are w.r.t. the uniform distribution over all typical z-sequences. 
Consider now a random selection of the 2 Rm = n/m members of J- independently and with 
uniform distribution over X m . On the one hand, this induces the uniform distribution over 
X n , for which we know, by the converse to the W-Z rate-distortion theorem, that either 

-l£W + Z",X")>nA, (38) 



or 



1 -J2 EL ( un + zn )^ nR x\y( A ), (39) 



a 

u 



where the inner expectations over Z n are as before (note that as U n is uniformly distributed 
then so is X n = U n + Z n regardless of Z n , which is independent). It then follows that there 
exists A > (which is bounded independently of n) such that 

-^EW + ^ + ^K + ^.^l > n[i^f(A) + AA]. (40) 

To see why this is true, let us denote nA' = a~ n J2 u n Ep(u n + Z n , X n ), and then the left- 
hand side of eq. (40) is lower bounded by n[i?^f (A') + AA']. Now, if A' < A, then eq. 
(40) clearly holds for A = 0. Else, if A' > A, it obviously holds for 



A 



R x\y( A ) - r x\y( A ') 



< 



A' 


-A 


R%*{0) - 




A - 




H(X\Y) - 


-i^f(A) 



(41) 

where the inequality follows from the convexity of the Wyner-Ziv rate-distortion function 
[10], [1, p. 439, Lemma 14.9.1]. Therefore, in either case, the value of A that satisfies eq. 
(40) is bounded independently of n (for A > 0). For this value of A, we then have 

]T [EL(u n + Z n ) + XEp(u n + Z n , X n )} 

u n eg c 
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= i EL (.u n + Z n ) + \E P {u n + Z n ,X n )}- 
[EL(u n + Z n ) + A£p(u n + Z n , X n )) 

u n &g 

> na n [-R^f'(A) + AA] - ra|0|[flj^(A) - e + AA] 

> to" [R%fr ( A) + A A] - na n [R%fr ( A) - e + A A] 

= nea n . (42) 

On the other hand, assuming that m&x x po(x) = p max < oo, and that max x n L(x n ) < nL for 
some finite L > (otherwise, long codewords would be better transmitted uncompressed 
plus an extra bit to tell that they are uncompressed), then we have 

]T [EL(u n + Z n ) + \Ep{u n + Z n ,X n )] < n\G c \(L + Xp max ) 

u n eG c 

= Pr{g c }na n (L + Xp m ^) 

< n2- mS a n (L + X Pm&x ). (43) 

Comparing now the right-most sides of eqs. (42) and (43), we get e < (L + \p max )2~ mS , 
which is a contradiction for all large n (and m) since e > was assumed a constant (and 
A is bounded). Thus, we have disproved the existence of a code that achieves (R,A) 
simulateneously for all typical sequences of the process described above. This completes 
the proof. 

Appendix 

Extending the Results of Section 3 to Successive Refinement Codes 

Consider a two-stage coding scheme with a successive refinement structure. The first 
stage is as in Section 3. In the second stage, there is an additional finite-state encoder 
£' that transmits a refining description v = (vi,V2,---) of variable-length binary strings 
(similarly as u), and an additional finite-state decoder V' that has access to both u and v 
as well as to another SI sequence z = (z±, Z2, ■ ■ ■)■ It is assumed that 

n 

P(y n ,z n \x n ) = '[[P(y i ,z i \x i ), n=l,2,... (A.l) 

i=i 

More precisely, the description of the second stage is as follows: When the sequence x 
is sequentially fed into a variable-rate finite-state encoder £' = (T,p,g), this encoder 
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generates an infinite sequence of binary strings of variable length, v = (v\,V2, ■ ■ ■), while 
going through an infinite sequence of states ii, <2, . . . according to 

Vi = p(U,Xi), 
t i+1 = q(ti,Xi), i = l,2,... (A.2) 

where the initial state t\ is assumed to be a certain fixed member of T. At the same time 
and in a similar manner, the second-stage finite-state decoder V' = (T',p f , q') sequentially 
maps u, v and z to an infinite reproduction sequence X\,X2, ■ ■ ., using the recursion 

Xj_ rf / = p'^Ui^^Zi), i = d! + l,d! + 2, . . . 
t'i+i = d(t'i,Ui,Vi,Zi), i = l,2,.... (A.3) 

where d! (positive integer) is the second-stage encoding-decoding delay, and the initial state 
t'i is assumed to be a certain fixed member of T . Similarly to the model of variable-rate 
coding of the first stage, it is assumed that at each time instant i, when state decoder is at 
state t\ and reads 6 the current first-stage codeword Uj , it is able to isolate the current input 
codeword Vi from the following codewords of the compressed bitstream, Vi+±,Vi+2, ■ ■ ■■ To 
this end, we allow a prefix code C'(t',u) associated with each if G T and u £ C(s'). For 
every v <G C'(t'), let L'(y) denote the length of v (in bits). We then assume that the Kraft 
inequality 

J2 2- L '^ < 1 (A.4) 
veC(t',u) 

holds for all t' G T , u G C(s'). Again, this discussion continues to apply when single 
codewords are replaced by ^-vectors, v e , formed by concatenating i (legitimate) codewords 
successively. In this case, let [C'] e (t',u l ) denote the supercode formed by all {v e } that 
originate from (t',u l ). Clearly, since the components of v e can be identified recursively, 
the supercode satisfies the Kraft inequality as well w.r.t. the length function L'(v e ) = 

Ei=iL'( Vi ). 

Let A'(x n ,£,V) denote the expected distortion \Ep'{x n ,X n ) = \Y2=\ E p\ x i^i) 

associated with the second-stage encoding and decoding of x n by (£, T>) and (£' , V), where 

the expectation is w.r.t. the DMC's. For a given x n and rate pair (R,AR), a distortion 

pair (D\,D2) is said to be (M, d, <f) -achievable if there exist encoder-decoders (S,V) and 
6 The second stage decoder can keep a copy of s" of footnote no. 2 as it has access to {in} as well. 
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(£',V), having no more than M states each, with first-stage delay not exceeding d, second- 
stage delay not exceeding d', and which satisfy the rate constraints: 

n 

]TL(u;) < nR 
i=i 

n 

J2 L '( v i) ^ nAR ( A - 5 ) 
i=l 

and 

A(x n ,£,V) < D 1 
A'{x n ,£',V) < D 2 (A.6) 

Let AM,d,d'(x n ,R, AR) denote the set of distortion pairs (Di,D 2 ) that are (M,d,d')~ 
achievable for x n . 

While the definition of AM,d,d'(x n , R, AR) is operational, we next define an informational 
achievable region A x t\ Y t^ z t(R, AR) (with X 1 being distributed according to the empirical 
distirbution of ^-vectors) as follows: (Di,D 2 ) € A x ei Y e^ z e(R, AR) iff there exist random 
variables U, V that satisfy the Markov relation (U, V) — > X 1 — > (Y e , Z e ) and functions h 
and /i' such that 

Ep{X e ,h(Y e ,U)) < £D 1 
Ep'(X e ,h'(Z e ,U,V)) < £D 2 
H{U) < £R 
H(V\U) < £AR (A.7) 

In the theorem below, which is an extension of Theorem 1, A x e\ Y e,z e AR) — (5, 5') means 
the set {(£>i -5,D 2 -5') : (D U D 2 ) € A x e ]Y e iZ e{R, AR)}. 

Theorem 4 : For every positive integers n and £ such that £ divides n: 

A MAd ,(x n ,R,AR) C A x e lY e jZ e (r + ^i^, A R+ ^j^j ~ - £ (p max d, p' max d'). (A.8) 

where p^ ax = max X)5 p'(x, x). 

Proof. The first stage is as before. As for the second stage, similarly to (18), (19) and 
(21), we have 

tAR > H{V e \U e ,T') 
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> H(V e ,T'\U £ ,S,S',T') 

> H(V e ,T,T'\U e ,S,S') -logAf (A.9) 

Also, (U\ V 1 ) -► 5, S', T, T') -» (Y € , Z<) is a Markov chain. Again, since (S, S', T, T') -► 
X £ -» (y'.Z*) is also a Markov chain, then so is (U e , S, S' ,V e ,T,T') -» X £ -» y< Z £ . 
The reconstructed output A^~ d ' is a function of (C/^, V e , T', Z e ) which is a special case 
of a function of (U e , S, S' ,V f ~ ,T,T' , Z f ~) and the distortion at the second stage is then 
lower bounded by jEp' (X e , X e ) — p' max d'/£ as before. Hence, defining U = (U e ,S,S r ) 
and V = (V , T, T'), we have found RV's (U, V) and functions h and h! that satisfy the con- 
ditions for (D 1 -p max d/£,D 2 -p' max d'/£) being in A x t\ Y i jZ i(R+2logM/e,AR + 2logM/£) 
provided that {D U D 2 ) G A MAd ,(x n , R, AR). □ 

The achievability is conceptually simple. Again, the first stage is as before. The second 
stage is a conditional version of the first where both encoder and decoder have access to 
the already decoded U. 

Finally, a few comments are in order, in addition to the comments made at the Discussion 
of Section 3: 

1. While in [8] there is no apparent way to generalize the results to more than two stages 
(unless the Si's are identical), here the extension is straightforward. 

2. Unlike in [8], here there is no need for the Markov structure X —> Z —> Y. 

3. The alphabet sizes required for U and V are \U\ < a 1 + 3 and |V| < a 1 ■ \U\ + 1. 
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